218        Bioinformatics

strand forming an initial DNA–RNA hybrid from which the new mRNA transcript is

separated. The purpose of this exercise is to investigate the promoter regions in the gene

targeted by the DNA-directed RNA polymerase II subunit RPB1 during gene transcrip-

tion. The raw data consists of four single-end FASTQ files generated by Illumina Genome

Analyzer and available at ENCODE database with the accession numbers: ENCFF000XJP,

ENCFF000XJS, and ENCFF000XKD, and the accession number of the input data (control)

is ENCSR000EZM. For the sake of keeping the files organized, we can create a project

directory called “chipseq”, and inside that directory, we can create a subdirectory called

“data” where we can download the FASTQ files as follows:

mkdir chipseq; cd chipseq; mkdir data

wget \

-O “data/ENCFF000XJP_chp1.fastq.gz” \

“https://www.encodeproject.org/files/ENCFF000XJP/@@download/

ENCFF000XJP.fastq.gz”

wget \

-O “data/ENCFF000XJS_chp2.fastq.gz” \

“https://www.encodeproject.org/files/ENCFF000XJS/@@download/

ENCFF000XJS.fastq.gz”

wget \

-O “data/ENCFF000XKD_chp3.fastq.gz” \

“https://www.encodeproject.org/files/ENCFF000XKD/@@download/

ENCFF000XKD.fastq.gz”

wget \

-O “data/ENCFF000XGP_inp0.fastq.gz” \

“https://www.encodeproject.org/files/ENCFF000XGP/@@download/

ENCFF000XGP.fastq.gz”

The four files will be downloaded into the “data” directory. The four files are

ENCFF000XJP_chp1.fastq.gz, ENCFF000XJS_chp2.fastq.gz, ENCFF000XKD_chp3.fastq.

gz, and ENCFF000XGP_inp0.fastq.gz. The latter is the FASTQ file that contains the input

or control data.

6.3.2  Quality Control

The quality control is an important step in all sequencing data analysis workflows. The

quality of the reads in the FASTQ file can be assessed by an appropriate program like

FastQC to check the read quality, technical sequences such as adaptor dimer and PCR

duplicate reads, GC-content bias, and other sequencing biases. We should try to fix any

potential problem as possible before proceeding to the mapping step. Refer to Chapter 1 for

the quality assessment metrics and the approaches to fix the potential faults.

cd data

fastqc \

ENCFF000XJP_chp1.fastq.gz \

ENCFF000XJS_chp2.fastq.gz \